Working with large-scale speech corpus for phonetic research: Pipeline and tools


Chenzi Xu
MPhil DPhil (Oxon)

University of Oxford
Workshop at Universiteit Leiden

August 11, 2025

About me

Leverhulme Trust Early Career Fellow

University of Oxford

The rise and fall of a tone


Postdoctoral Research Associate

University of York

Person-specific Automatic Speaker Recognition: Understanding the behaviour of individuals for applications of ASR


DPhil, MPhil (Distinction),

University of Oxford

Why work with large corpora, for phonetic research?


  • Capturing variation and change
    • Longitudinal and synchronic variation
    • Dialectology, sociophonetics, forensic phonetics
    • Growing availability of “in-the-wild” corpora
  • Statistical Power
  • Generalisability
  • Reproducibility
  • Advancing the phonetic toolset

Roadmap


Corpus Data Access

1 2 3 4

Know Your Device

1 2 3 4


5 important facts about your device or server

  • OS (operating system)
  • CPU (central processing unit): cores, threads
  • GPU (graphics processing unit): VRAM, CUDA
  • RAM (random-access memory)
  • Storage: Free Disk space
system_profiler SPHardwareDataType | head -n 10
Hardware:

    Hardware Overview:

      Model Name: MacBook Pro
      Model Identifier: Mac15,11
      Model Number: MRW33B/A
      Chip: Apple M3 Max
      Total Number of Cores: 14 (10 performance and 4 efficiency)
      Memory: 36 GB
df -h /
Filesystem        Size    Used   Avail Capacity iused ifree %iused  Mounted on
/dev/disk3s1s1   926Gi    10Gi   384Gi     3%    426k  4.0G    0%   /

Corpus Structure

1 2 3 4


Audio Formats

  • WAV
  • FLAC
  • MP3

Transcription Formats

  • Plain Text
  • TextGrid
  • ELAN
  • JSON

Corpus Management Ecosystems: Kaldi-style

  • File-based metadata system
    • Plain-text files: easy to read, edit, and version-control
    • Flexible: support rich and custom annotations
  • Broad tool support
    • Out-of-the-box Kaldi and ESPnet scripts for validation, filtering, splitting, and merging datasets

Corpus Structure

1 2 3 4


Corpus Management Ecosystems: Kaldi-style

  • File-based metadata system
    • wav.scp: maps recording/utterance IDs to audio paths
    • text: transcripts for each uterance
    • utt2spk and spk2utt: links utterances to speakers
    • *segments: (for long recordings) start and end times for utterances

Conventions:

  • Each file is space‑separated, strictly sorted by the first field, no duplicates.
  • IDs are opaque but consistent.

Corpus Structure

1 2 3 4


KeSpeech1

kespeech
├── audio
│   ├── phase1
│   └── phase2
└── metadata
    ├── city2Chinese
    ├── city2subdialect
    ├── phase1.text
    ├── phase1.utt2style
    ├── phase1.utt2subdialect
    ├── phase1.wav.scp
    ├── phase2.text
    ├── phase2.utt2env
    ├── phase2.wav.scp
    ├── spk2age
    ├── spk2city
    ├── spk2gender
    ├── spk2utt
    ├── subdialect2Chinese
    └── utt2spk


Corpus Management Ecosystems: Kaldi-style

  • Broad tool support
# Validation of datasets
utils/validate_data_dir.sh --no-feats kespeech/metadata

# Generate utterance durations
utils/data/get_utt2dur.sh kespeech/metadata

# Keep shortest/longest N for debugging
utils/subset_data_dir.sh --shortest kespeech/metadata 100 kspeech/ks_s100

# Random 10k-utterance subset (useful for quick experiments)
utils/subset_data_dir.sh kespeech/metadata 10000 kespeech/ks_10k

# Make dev/test by speakers (avoid speaker leakage)
utils/subset_data_dir_tr_cv_spk.sh --cv-spk-percent 15 kespeech/metadata kespeech/train kespeech/dev

ESPnet: End-to-end speech processing toolkit

ESPnet bundles the same Kaldi recipes and scripts.

Corpus Structure

1 2 3 4


Audio Formats

  • WAV
  • FLAC
  • MP3

Transcription Formats

  • Plain Text
  • TextGrid
  • ELAN
  • JSON

Corpus Management Ecosystems: Huggingface

  • Parquet/Arrow format metadata
    • A single, columnar file structure for audio and annotations
    • Cloud hosting and distribution via Hugging Face Hub
  • Wide ecosystem integration: Python API
    • Native streaming and loading from Hugging Face Hub
    • Built-in tools for filtering, batching, and audio decoding
    • Seamless compatibility with PyTorch, TensorFlow, and other ML frameworks

Corpus Structure

1 2 3 4


Common Voice1

cv-corpus-22/yue
├── clip_durations.tsv
├── invalidated.tsv
├── other.tsv
├── reported.tsv
├── unvalidated_sentences.tsv
├── validated_sentences.tsv
└── validated.tsv
└── clips
    ├── common_voice_yue_31172849.mp3
    ├── common_voice_yue_31172850.mp3
    └── ...


Corpus Management Ecosystems: Huggingface

  • Parquet/Arrow format metadata

Corpus Structure

1 2 3 4


Common Voice1

cv-corpus-22/yue
├── clip_durations.tsv
├── invalidated.tsv
├── other.tsv
├── reported.tsv
├── unvalidated_sentences.tsv
├── validated_sentences.tsv
└── validated.tsv
└── clips
    ├── common_voice_yue_31172849.mp3
    ├── common_voice_yue_31172850.mp3
    └── ...


Corpus Management Ecosystems: Huggingface

  • Wide ecosystem integration: Python API
from datasets import load_dataset, Audio

# Load dataset on the fly
cv_22_yue = load_dataset("fsicoli/common_voice_22_0", "yue", split="train", streaming=True)

# Decode audio
cv_22_yue = cv_22_yue.cast_column("audio", Audio(sampling_rate=16_000))

# Remove very short utterances
cv_22_yue = cv_22_yue.filter(lambda x: len(x["sentence"]) > 3)

# Make dev/test by speakers (avoid speaker leakage)
split_yue = cv_22_yue.train_test_split(test_size=0.15, seed=42, stratify_by_column="client_id")
train_yue = split_yue["train"]
test_yue  = split_yue["test"]

Data Preprocessing

1 2 3 4

Transcription

1 2 3 4


Open-source ASR toolkits


Pretrained model APIs


Cloud ASR APIs


Tutorial

A hands-on introductory tutorial on applying Whisper and wav2Vec 2.0 is avilable at here.

Transcription: Whisper1

1 2 3 4


Advantages:

  • High accuracy: Handles long-form audio natively
  • Multilingual: Support 50+ languages (trained on 98 languages)
  • Multi-tasking: Transcription, translation, timestamping
  • Offline: Privacy-friendly
  • Flexible: Finetuneable

Kaldi Gigaspeech XL • facebook/wav2vec2-large-robust-ft-libri-960h • Whisper medium.en2

Transcription: Whisper1

1 2 3 4


Weaknesses:

  • May hallucinate
  • Produce “(too) clean” output (disfluency and hesitation features removed)
  • Large models can be GPU-heavy and slow
  • Accuracy depends on domain and language

Whisper architecture2

Transcription: Advanced Whisper Techniques

1 2 3 4


Model configuration

  1. Provide language selection

  2. Initial prompt

  3. Dynamic temperature fallback

Segmentation strategies

4. Whisper-timestamped

5. VAD-guided chunking

model_size = "large-v2"
language = "en"
task = "transcribe"
initial_prompt = "umm uhh oh ah hm er erm urgh mm"

transcribe_args = {
    "task": task,
    "language": language,
    "patience": None,
    "length_penalty": None,
    "suppress_tokens": "-1",
    "initial_prompt": initial_prompt,
    "fp16": False,
    "condition_on_previous_text": False,
    "vad": True,
    "best_of": 5,
    "beam_size": 5,
    "temperature": (0.0, 0.2, 0.4, 0.6, 0.8, 1.0),
}

result = whisper.transcribe(model, audio, seed=seed, **transcribe_args)

Transcription: Advanced Whisper Techniques

1 2 3 4


Model configuration

1. Provide language selection

2. Initial prompt

3. Dynamic temperature fallback

Segmentation strategies

  1. Whisper-timestamped1

  2. VAD-guided chunking

pip3 install whisper-timestamped

# for Voice Activity Detection
pip3 install onnxruntime torchaudio


transcribe_args = {
    "task": task,
    "language": language,
    "patience": None,
    "length_penalty": None,
    "suppress_tokens": "-1",
    "initial_prompt": initial_prompt,
    "fp16": False,
    "condition_on_previous_text": False,
    "vad": True,
    "best_of": 5,
    "beam_size": 5,
    "temperature": (0.0, 0.2, 0.4, 0.6, 0.8, 1.0),
}

result = whisper.transcribe(model, audio, seed=seed, **transcribe_args)

Transcription: Adapting Whisper

1 2 3 4


Vanilla Fine Tuning

  • \(W_{finetune} = W_{pre-trained} + \Delta W\)
  • All parameters update

  • Finetuning Whisper-large-v2 with insufficient 20h data reduced the average WER by 54.94% for 7 low-resource language1.

  • Whisper-large-v2 model requires ~24GB of GPU VRAM for full fine-tuning and requires ~7 GB of storage for each fine-tuned checkpoint.

Transcription: Adapting Whisper

1 2 3 4


1

Transcription: Adapting Whisper

1 2 3 4


Low-Rank Adaptation (LoRA)1

  • Freeze pre-trained weights
  • \(\Delta W = AB\) (Rank decomposition Matrix)

  • Lightweight training: Often only 1-5% of Whisper’s parameters are updated
  • Reduce risk of catastrophic forgetting when most of the pretrained parameters are frozen
  • Great for domain shifts, new accents, and language varieties

Transcription: Adapting Whisper

1 2 3 4


Soft Prompts Tuning

  • Prepend \(m\) token embeddings \(V = {v_0,..., V_{m-1}}\) to the decoder input embeddings1
  • Freeze \(\theta_{ASR}\) and Training only update \(\theta_V\)
  • Generation: \(\hat{Y} = argmax_YP(Y|X;\theta_{ASR},\theta_V)\)


[v0 v1 ... v{m-1}] + [<sot> ... previous text tokens ...]
          │                          │
      trainable                 frozen decoder


  • Provide initial context to model inputs to guide the generation process
  • Parameters of the pre-trained model remain unchanged
  • More parameter-efficient than fine-tuning
  • Great for style or domain shifts

  • Cannot fix missing acoustic/phonetic features

Transcription: Conversational data?

1 2 3 4


Diarisation + ASR:

Chaining Pyannotate and Whisper


Speaker Diarization: “Who is speaking and when?”

  • Obtaining transcription and diarisatioin segments
  • Matching transcription segments to speakers
  • Handling temporal mismatches
  • Merging consecutive segments from the same speaker

Python Demo1:

class SpeakerAligner:
    def align(self, transcription, timestamps, diarization):
        speaker_transcriptions = []

        # Find the end time of the last segment in diarization
        last_diarization_end = self.get_last_segment(diarization).end

        for chunk in timestamps:
            chunk_start = chunk["timestamp"][0]
            chunk_end = chunk["timestamp"][1]
            segment_text = chunk["text"]

            # Handle the case where chunk_end is None
            if chunk_end is None:
                # Use the end of the last diarization segment as the default end time
                chunk_end = (
                    last_diarization_end
                    if last_diarization_end is not None
                    else chunk_start
                )

            # Find the best matching speaker segment
            best_match = self.find_best_match(diarization, chunk_start, chunk_end)
            if best_match:
                speaker = best_match[2]  # Extract the speaker label
                speaker_transcriptions.append(
                    (speaker, chunk_start, chunk_end, segment_text)
                )

        # Merge consecutive segments of the same speaker
        speaker_transcriptions = self.merge_consecutive_segments(speaker_transcriptions)
        return speaker_transcriptions

    def find_best_match(self, diarization, start_time, end_time):
        best_match = None
        max_intersection = 0

        for turn, _, speaker in diarization.itertracks(yield_label=True):
            turn_start = turn.start
            turn_end = turn.end

            # Calculate intersection manually
            intersection_start = max(start_time, turn_start)
            intersection_end = min(end_time, turn_end)

            if intersection_start < intersection_end:
                intersection_length = intersection_end - intersection_start
                if intersection_length > max_intersection:
                    max_intersection = intersection_length
                    best_match = (turn_start, turn_end, speaker)

        return best_match

    def merge_consecutive_segments(self, segments):
        merged_segments = []
        previous_segment = None

        for segment in segments:
            if previous_segment is None:
                previous_segment = segment
            else:
                if segment[0] == previous_segment[0]:
                    # Merge segments of the same speaker that are consecutive
                    previous_segment = (
                        previous_segment[0],
                        previous_segment[1],
                        segment[2],
                        previous_segment[3] + segment[3],
                    )
                else:
                    merged_segments.append(previous_segment)
                    previous_segment = segment

        if previous_segment:
            merged_segments.append(previous_segment)

        return merged_segments

    def get_last_segment(self, annotation):
        last_segment = None
        for segment in annotation.itersegments():
            last_segment = segment
        return last_segment

Transcription

1 2 3 4


Resources:

@awesome-whisper

@whisper-finetune

@fast-whisper-finetuning

Whisper Training Starter Kit

Fast Whisper-Large-v2 Fine-Tuning with LoRA

Whisper Precision: A Comprehensive Guide to Fine-Tuning and Hyperparameter Tuning

Fine-tuning Whisper on Low-Resource Languages for Real-World Applications

Fine-Tune Whisper For Multilingual ASR with 🤗 Transformers

Preprocessing Procedure

1 2 3 4


Audio

  • Path normalisation
  • Audio format conversion
  • Segmenting long recordings
  • diraization


Metadata

  • Proportional balancing
  • Quality checks

Text

  • Encoding normalisation (UTF-8)
  • Whitespace cleanup
  • Character filtering
  • Length filtering
  • Tokenisation

Preprocessing Procedure

1 2 3 4


Data Querying

1 2 3 4

Basic Querying Methods: Filtering and Flagging

1 2 3 4


Kaldi-style

  • Unix commands are the key
    • sed
    • grep
    • awk
  • Regular expression
  • Utility functions such as utils/subset_data_dir.sh

Huggingface

  • Pandas-like filtering:
    • .filter()
    • .select()
    • .map()
  • Combining with lambda function or boolean masks
  • Row indices and column names

Speech Unit Retrieval I

1 2 3 4


Tabular TextGrid + Text Processing


Example: .TextGrid

File type = "ooTextFile"
Object class = "TextGrid"

xmin = 0.0125 
xmax = 1.8725 
tiers? <exists> 
size = 2 
item []: 
    item [1]:
        class = "IntervalTier" 
        name = "phone" 
        xmin = 0.0125 
        xmax = 1.8725 
        intervals: size = 26 
        intervals [1]:
            xmin = 0.0125 
            xmax = 0.0925 
            text = "t" 

words.txt

妈,0.0125,0.3325,0.3200,b01_1_101q
妈,0.3325,0.5125,0.1800,b01_1_101q
们,0.5125,0.7925,0.2800,b01_1_101q
正,0.7925,1.1125,0.3200,b01_1_101q
看,1.1125,1.6725,0.5600,b01_1_101q
...

Speech Unit Retrieval I

1 2 3 4


Tabular TextGrid + Text Processing


Example:

Create a find_de.awk snippet to find disyllabic phrases in Mandarin that end with “的”.

{if ($1 == "的") 
    {print prev, $0
    prev = $0}
else
prev = $0
}

Run:

awk -f find_de.awk words.txt > de_phrases.txt


Disyllabic phrases extracted:

大 0.0125 0.1225 0.1100 b08_1_114a 的 0.1225 0.2060 0.0835 b08_1_114a
搭 0.2351 0.3725 0.1374 b02_1_119q 的 0.3725 0.4638 0.0913 b02_1_119q
...

Speech Unit Retrieval II

1 2 3 4


The Pythonic way + LLMs

Python interface to Praat TextGrid

  • tgt
  • praatio
  • textgrid
  • parselmouth

Large Language Models

  • Produce syntactic and semantic parses
  • Disambiguate meanings in context
  • Flexible with few-shot learning and promting

  • Most LLMs API are not free
  • May hallucinate

Speech Unit Retrieval II

1 2 3 4


The Pythonic way + LLMs

Example:

Neutral tone pairs in Mandarin
index pos word pinyin if_neutral meaning group
1 N 地方 dìfang 某一区域、空间的一部分、部位 A
1 N 地方 dìfāng 中央下属的各级行政区划的统称,本地、当地 A
2 N 地下 dìxia 指地面上 A
2 N 地下 dìxià 指地面下或秘密的 A
3 N 东西 dōngxi 泛指各种事物,特指人或动物 A
3 N 东西 dōngxī 指东和西两个方向 A

Speech Unit Retrieval II

1 2 3 4


The Pythonic way + LLMs


System:

Set high-level instructions and context


User:

The actual query or input


Assistant’s Reply:

The model’s generated output

system_msg = (
    "You are a linguist specializing in Chinese semantics. "
    "Given a sentence and a target word with multiple meanings, "
    "your job is to identify the most appropriate meaning from the candidate list."
)

user_msg = (
    f"Sentence: {text}\n"
    f"Target word: {word}\n"
    f"Candidate meanings:\n"
    + "\n".join(f"{i+1}. {c}" for i, c in enumerate(candidate_strings))
    + "\n\nWhich meaning fits best? Respond only with the number (e.g., 1, 2) or 'None'. "
)

response = client.chat.completions.create(
    model="gpt-4.1-nano",
    messages=[
        {"role": "system", "content": system_msg},
        {"role": "user", "content": user_msg},
    ],
    temperature=0.2,
)

Speech Unit Retrieval II

1 2 3 4


The Pythonic way + LLMs

Example output:

Data Processing and Analysis

1 2 3 4

Acoustic Measurement

1 2 3 4


Praat series

  • Well-documented and actively maintained
  • Scriptable for large-scale batch analysis
  • Compatibility with Python and R via @Parselmouth and @rPraat
  • Graphical interface-friendly

  • Limited built-in batch processing unless scripted

Acoustic Measurement

1 2 3 4


openSMILE

  • Extracts hundreds of acoustic features
  • Configurable feature sets (e.g., eGeMAPS, ComParE).
  • Fast for large corpora
  • Command line-friendly

  • Steeper learning curve for custom config files
  • Some formant estimates can be unstable (e.g., formant measurements for telephone recordings need to be checked)

Acoustic Measurement

1 2 3 4


VoiceSauce

  • Specialised in voice quality measures such as H1–H2, spectral tilt, and cepstral peak prominence.
  • Integrates multiple algorithms for pitch/formant tracking (Praat, Snack, STRAIGHT).

  • MATLAB-based and not actively maintained (some functions not compatible with new version of Matlab)
  • Slow processing on large corpora
  • Limited general-purpose acoustic features compared to openSMILE

Acoustic Measurement

1 2 3 4


Factors influencing acoustic measures

  • Telecommunication channels (e.g., VoIP technologies)1
  • Voice quality (e.g., breathy and creaky)
  • Recording environment (e.g., music, noises)



Build Your Anomaly Flagging System

1 2 3 4


  • Compute summary statistics on the fly and flag anomalies for review
    • Deviations from mean or median (> k × SD)
    • Extremely high or low values
    • Abrupt value shifts (e.g., F0 jumps)
  • Select a small portion for auditory evaluation

Thank you ⭐️

@ChenziAmy
@chenchenzi
chenzixu.rbind.io